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Abstract 

Parallel computation offers the potential for quickly solving large computational problems. 
However, it is often a non- trivial task to effectively use parallel computers. Solution methods 
must sometimes be reformulated to exploit parallelism; the reformulations are often more com- 
plex than their slower serial counterparts. We illustrate these points by studying the paralleliza- 
tion of sparse one-dimensional dynamic programming problems, those which do not obviously 
admit substantial parallelization. We propose a new method for parallelizing such problems, 
develop analytic models which help us to identify problems which parallelize well, and compare 
the performance of our algorithm with existing algorithms on a multiprocessor. 




‘Research was supported under the National Aeronautics and Space Administration under NASA Contract No. 
NAS1-18605 while the author was in residence at the Institute for Computer Applications in Science and Engineering 
(ICASE), NASA Langley Research Center, Hampton, VA 23665. 


I 


i 



The availability of commercial parallel computers offers the potential for quickly solving large 
computational problems. However, to exploit parallelism it may be necessary to reformulate the 
solution algorithm; the construction of effective parallel algorithms is still an art in its infancy. 
Parallel solutions may be more complex than their serial counterparts, and may rely on insights 
not generally called for in serial algorithms. 

Consider the following general one-dimensional dynamic programming problem: find the solu- 
tions to the functional equation 

V(l) = 0 

(1) V(j) = min {C(iJ) + V(i)} for j = 2,. . . ,n, 

« e N(j) 

where for each j, N(j) is a non-empty subset of {1,2,. . . , j - 1}. It can be useful to view this prob- 
lem as that of finding the length of the shortest path from node 1 to every other node in an acyclic 
graph, where i £ N(j) implies the existence of a directed edge from i to j , i < j , weighted by 
C(i, j). This formulation is extremely general, since the nodes of any directed acyclic graph may 
be so ordered. If instead the nodes are numbered so that the edges go from higher to lower nodes, 
this formulation arises in classic problems such as optimal production scheduling (with or without 
backlogging) when holding costs are concave^. 

We are interested in solving this problem on a medium-scale multiprocessor, where the size 
of the graph is significantly larger than the number of processors. An obvious parallel solution 
method, the vertical method, computes the V(j) values serially, but employs P multiple processors 
to compute the edge sums specified on the right-hand-side of (l). The set N(j ) is partitioned among 
the processors, who each find the index optimizing C(i,j ) + V(i) among their assigned indices. 
Finally, the processors cooperatively compute V(j) in O(logP) steps. This scheme can effectively 
exploit parallelism, provided there is a large amount of necessary parallelizable computation. This 
condition is violated if the average cardinality of the N(j ) sets is small, or equivalently, the average 
indegree of a node is small. In this situation, relatively little parallel work is performed between each 
synchronization, and its attendant overhead. The vertical approach exploits intra- node parallelism; 
sparse graphs with relatively few edges tend to have little intra-node parallelism. We propose and 
analyze a method which exploits inter- node parallelism, where a multiprocessor works concurrently 
on the solution of multiple V(j) values. We call this type of approach horizontal 
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This work is an outgrowth of earlier workI 7 J where we considered the parallel solution of equa- 
tions 


V(l) = 1 

v(j) = C U) + min {^(0) for j = 

*(i) < * < i 

where z(j) is nondecreasing in j. This type of equation arises in the context of mapping compu- 
tations onto parallel computers^. Even though the indegree of a node may be large, the vertical 
approach is inefficient relative to the optimal serial solution. An efficient serial solution maintains 
a priority heap of previously computed V(j ) values, allowing it to find the min term for a given j 
in O(logn) time. By contrast, the vertical method requires 0(n/P + logP) time to find the same 
quantity. We proposed a horizontal method which computes P min terms optimistically each step, 
meaning that if V(i) and V(j) (i < j ) are computed concurrently in a step, then V(j) assumes 
that its min term is not defined by V(i) (the value of which is not yet known). A simple check 
determines whether the optimism is warranted — if F(j)’s min term is defined by V(i ), then the 
function values for that step are recomputed serially. The probability of serialization is 0(1/ P) if 
j - i{j) is 0(P 2 ) and the C{j) values are independent and identically distributed random variables. 
Our algorithm is based upon this approach, and we will derive similar results for it. 

Bertsekas and Tsitsiklis (1989) explain other parallel dynamic programming algorithms^. These 
algorithms are iterative, requiring multiple passes over the node set. Our algorithms solve the prob- 
lem in a single pass, but use more global synchronizations. We present empirical results showing 
our methods to be superior on the problems we study. 

This study relies on random directed graphs in order to characterize general properties of 
problems, and algorithm behavior. These graphs are described by two parameters, a (constant) 
expected indegree, and the size of an interval from which a node’s incoming edges may be taken. 
All of our experiments and analyses assume this model. 

The paper is organized as follows. §2 discusses our methods for generating random graphs. §3 
describes the block window algorithm, while §4 derives a performance model, and identifies classes 
of graphs which support the algorithm’s approach. §5 describes our computational experience on 
a sixteen processor shared memory architecture. We find that our algorithm outperforms both the 
vertical and iterative methods, achieving good speedups for a variety of problem types. §6 presents 
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our conclusions. 


1 Random Graph Generation 

Before discussing the algorithm, we first describe the methods used to generate random graphs. 
Our experiments and analysis concern graphs whose edges and weights are created randomly. We 
view a graph as a linear sequence of nodes 1 , ,n, where directed edges feed forward from lower 
to higher numbered nodes. We study random graphs that are connected, where an edge (i,j) is 
constrained by j — i < w for some w, and where, whenever i,j > w , nodes i and j have the same 
expected number of incoming edges. To construct such graphs we employ a node interval length 
w and an indegree parameter D < w, both user defined. For each node j we first create an initial 
edge from some node i in the interval [max{l,j — w},j — 1], chosen uniformly at random. This 
ensures that every node is reachable from node 1. We then compute an edge probability 

Pi = min f ~ , , l} • 
lmin{w,j - 1} - 1 J 

Pj takes value 1 when j < D. pj is constant for j > w , we denote that constant 

Pe = {D — 1 )/{w - 1). 

For each potential edge from nodes in j' s interval (other than the initial edge) we perform a Bernoulli 
trial, accepting the edge with probability pj. The expected indegree of j is D when j > D, and is 
j — 1 otherwise. Random edge weights are constructed using a variety of methods described in §4. 

2 Block Window Algorithm 

The block window algorithm partitions the graph into blocks P nodes wide. The algorithm slides 
a window from block to block, from left to right. When the window is positioned over a block, 
the V values for nodes within the block are computed in parallel. When all the block’s V values 
have been correctly computed the window is repositioned over the next block. Figure 1 depicts a 
partitioned graph, and the assignment of processors within a block. We label the nodes within the 
window by w\ through wp. Processor j is responsible for computing V(wj). 


3 



Figure 1: Partitioning a graph into blocks 


Consider the computation associated with a node wj in the window: compute 

V(wj)= min {C(i,Wj) + F(i)}. 

(i,w } ) € N(w,) 

For each Wj we take N(wj) = Ni(wj) U No(v>j), the union of edges rooted inside and outside the 
window, respectively. Block processing has two phases. In a first phase, each processor j computes 

uj = min {C(i,Wj) + F(i)}, 

€ No(wj) 

in parallel with the others, assigns V(wj ) = u j, and then synchronizes globally. With this assign- 
ment processor j is optimistically assuming that V(wj ) is not defined by a V value within the 
current block. If all processors are correct in this assumption, then the V values computed for this 
block are correct. The second phase determines whether any such V value is incorrect. For every 
edge (wi,Wj) € Nj(wj ) processor j compares C(w{, Wj) + u,- with Uj. If any such sum is strictly less 
than Uj we do not have Uj = V(wj). Following this test, the processors again synchronize globally, 
each one passing a flag indicating whether Uj ^ V(wj). Let j(min) be the smallest index of a pro- 
cessor indicating this inequality; define _?(mi n ) = n + 1 in the complete absence of the inequality. If 
j(min) < P, then values VXu>j(min))> ••• V(wp) are recomputed serially. The algorithm’s correctness 
rests on the following proposition. 
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Proposition 1 Let k 6 [1,P]. If 


Uj < min {C(w{,Wj) + w;}, 

e Ni(wj) 

for all j = 1, . . . , k — 1, then V(wj) = Uj for j = 1, . . . , k. 

Proof: AT/(itfi) is empty, so that F(uq) = wj. Suppose then that for some j < k, V(w{) = u* for 
i = 1, . . . , j - 1. Now 

uj = min {C(«,u?j) + V(i)} 

(*•«;>) e N 0 (wj) 

= min < min {C(i,wj) + V(i)}, min {C(u)i, wj) + u*} 

L (hu>})€N 0 (wj) (u>i,wj) e Nj(wj) 

= min <1 min {C(i,u?j) + V'(z)}, min {C(w{, wj) + V(i)} 

l e No(u> 3 ) ( w{ , wj ) € Nj ( wj ) 

= min {C(i,Wj) + V(i)} 

€ W(u'j) 

= V(i + 1). 

The conclusion follows by repeated application of this argument to j — 2, . . . , k. 

□ 

The algorithm is described in pseudo-code in figure 2. The parallel synchronization routine 
CheckSerial identifies j(m\n ), and computes V(wj^ n ))^ . . . ,V(w n ) serially. 

It is worthwhile pointing out that the the block window method in no way requires the problem 
of interest to be a minimization problem. We could equally well use it to solve equations having 
the general form of (1), but with a max operator. This observation increases the applicability of 
the algorithm to classic problems such as the one- resource integer knapsack problem^. 

The section to follow constructs a performance model for this algorithm, and identifies a class 
of graphs on which the second phase serializes with low probability. 

3 Analysis 

To achieve good performance the parallel algorithm must balance the workload well, avoid undue 
overhead, and avoid serialization in the second phase. In this section we construct a performance 
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BlockWindow(pid) 

/* pid is processor id, from 1 to P */ 

int pid; 

{ int Uj , wo,Wj, correct; 

/* wo is node “before” window */ 

w 0 = 0; 
while (icq < n) 

/* compute node index */ 

{ wj — wo + pid; 

/* first phase */ 

= minimum value of C(i,Wj) + V(i) over all i £ No(wj ); 

/* store optimistic solution, synchronize — */ 

V(wj) = uj ; 

/* synchronize globally */ 

Barrier(pid); 

/* begin second phase — */ 

If (C(i, u>j) + V'(i) < uj for any i £ Ni(wj) ) then correct = 0; 
else correct = 1; 

/* synchronize and correct values — */ 

Synch2: CheckSerial( pid, Wj, correct); 
w0 = wO + P] 

} 

} 


Figure 2: Pseudo-code for block window algorithm 

model for the algorithm, and show that if serialization is infrequent or if the indegree is moderately 
large we should achieve good speedups. We then show that if edge weights are independent and 
identically distributed (i.i.d.) and w is 0(P 3 ), then the probability of serialization is only 0(1/P). 

3.1 Performance Model 

The time required to process a block is determined by the processor with the heaviest workload. Our 
first task is to estimate the expected load on the most heavily loaded processor in the first phase, 
and then in the second phase. We can compare the sum of those loads to “perfectly balanced” 
loads to determine the extent of load imbalance. For simplicity, our analysis ignores the startup 
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anomalies, and only considers blocks sufficiently far into the graph, where the average indegree 
of each node is D . Let Oj be the cardinality of No(wj)- Due to the graph construction method 
the random variables 0i,...,0p are independent. However, they are not identically distributed; 
consider the construction of N(wj). The initial edge is in Ni(wj) with probability (j — l)/w. In 
this case, each of the possible w — j + 1 edges for No(wj) has probability p e of being chosen, and 
Oj has a binomial distribution, B(w — j + 1 ,p e )- If the initial edge is in No{wj) then we know 
that Oj has the distribution of one plus a B(w — j,p € ) random variable. Consequently, Oj has the 
mixed distribution 

(2) Distribution of Oj = (^r) B(w - j + 1 ,p e ) + (— ,£ —) (1 + B{w - j,p e )) . 

Now let Ij be the cardinality of Ni(wj). The distribution of Ij is also found by conditioning on 
where the initial edge lies, and so is found to be the mixture 

(3) Distribution of Ij = B(j - l,p e ) + (^-) (1 + B(j - 2,p e )). 

It is straightforward to estimate the expected values of Xo = maxj{Oj}, and X/ — maxj{Jj}. 
For example, let 

Fj(s) = Pr{0 3 <s } 

= Pr{B(w - j + 1, Pe) < ^} + Pr i B ( W ~ j'Pe) ^ 5 - !}• 

Now, Xo < s if and only if Oj < s for all j. Consider the identity: E[Y ] = IZfLo P r {Y > f° r 
any discrete nonnegative random variable Y. Using it we find that 

00 CO / P 

E[Xo ] = E Pr i x o > *} = E 1 " II W 

s = 0 s=0 \ j—l 

An identical approach allows us to compute E[Xj ]. For computational reasons we will ap- 
proximate the binomial distributions involved with Poisson distributions (this is a reasonable 
approximation^). A B(x,r) random variable is approximately equal to the number of Poisson 
arrivals in time [0, x] 7 where r is the arrival rate. We can view E[X o] as the expected time to 
complete the first phase, measured in units where the time required to process one edge takes one 
time unit. E[Xj\ is an estimate of the time needed to complete the second phase, if no serializa- 
tion occurs. E[Xj\ is not exact because the knowledge that no serialization occurs alters these 
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distributions. If serialization does occur, we estimate the number of edges processed serially by 
S = Lj Ij]. Again, this is not exact because knowledge that serialization occurs alters the 

distributions. It is also not exact because our implementation does not recompute all inter-block 
edges once the need for serialization is detected. With careful programming we can reduce the 
number of edge sums computed serially by ignoring those already known to be larger than their 
corresponding uj values. If / is the fraction of blocks requiring serialization, we estimate the time 
required to complete the second phase (ignoring the synchronization cost) by 

E[T I (f)] = (l-f)E[X I ] + fS. 

The sum E[Xo] + E[Tj(f )] is a measure of the time required to process a block (again, ignoring 
synchronization delays). 

The expected value of N(j ) is just D\ the expected number of edges processed in a block is 
consequently PD . If we could assign the load to processors perfectly, every processor would process, 
on average, D edges. A natural measure of load imbalance given the probability of serialization is 
&(/), the balance efficiency : 

6(/) = E{X 0 \ + E[Ti(f)Y 

Figure 3 plots the maximal and minimal balance efficiencies (take / = 0 and / = 1) as a 
function of w, for fixed D = 4, D = 16, and D = 256, and sixteen processors. We see that good 
efficiencies can be achieved even when / = 0, especially when D or w are large. In these cases the 
number of edges from outside the block is so large relative to those in the block that serialization 
scarcely affects performance. Poor load balance occurs if / is high and both w and D are small. 
§3.2 will show the dependence of / on the behavior of the edge weights. 

Load imbalance is only once source of performance degradation. Parallel algorithms frequently 
have a higher overhead per unit of computation than do their serial counterparts. We suppose that 
a serial solution requires a units of time per edge processing, while the parallel solution requires 
(3 > a units of time per edge. We call a//? the overhead efficiency . Both a and ,3 are problem 
sensitive; for example, they depend on the amortization of loop bounds checking over the number 
of loop iterations. The overhead efficiency can be estimated by dividing the execution time of the 
parallel algorithm running on one processor by the execution time of the specifically serial version. 
For the problems we studied the overhead efficiencies range from 0.5 to 0.83. 
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There is also a delay cost associated with executing the global synchronization; let G(P) be 
the time required for P processors to perform the global synchronization between windows. Our 
algorithm uses two synchronizations per block. The overall efficiency of the parallel solution is the 
speedup divided by the number of processors used, and so is given by 

Optimized Serial Time aPD 


Efficiency = 


Processors x Parallel Time 


( 4 ) 


P((3(E[Xo] + E{Tj(f)]) + 2G(P)) 

WW) 

1 + 2 G(P)/(l3(E[Xo} + E[T I (f)}y 


It is important to note that true speedup measurements use an optimal, specifically serial program. 

It is well-known that a barrier synchronization can be performed in O(logP) time. However, on 
the architecture we used the fastest barrier algorithm has a linear costW. We consequently use the 
model G(P) = 7 P, for some 7 . Measured in units where a = 1 , we have found 7 « 0.5. Our test 
suite of problems consists of four replicates each of problems where D = 8,16,32,64,96,128, and 
w = 64 and w — 256. On this suite the average magnitude of the relative error between measured 
speedup for up to eight processors, and that predicted by (4) is approximately five percent. These 
figures used measured frequencies of serialization. The performance model sometimes overestimates 
speedup for sixteen processors by as much as 33%, although the absolute error never exceeded 1.2. 
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This error seems to be due primarily to an under-estimation of <2(16); although it is also possible 
that f3 increases in P, a phenomenon unaccounted for by the model. 

Computer manufacturers are now planning architectures with fast hardware support for global 
synchronization. It is reasonable to ask how good the performance would be if synchronization costs 
were negligible. Equation (4) helps answer this question. If G(P ) vanishes the overall efficiency is 
just the product of the load and overhead efficiencies. We have already seen that load efficiencies 
are often good, at least under the assumed analytic model. The overhead efficiencies are between 
50% and 83%. The product of load and overhead efficiences (assuming sixteen processors) can 
exceed 50%, yielding a speedup of at least 8. Of course, it can be much lower. §4 will show that 
speedups depend primarily on D; with D = 128 and w — 256 we do achieve 50% efficiencies on 
sixteen processors. 

3.2 Probability of Serialization 

We now turn to a more qualitative analysis of the algorithm, where we focus on the behavior of 
V as a random function. To highlight the difference between -the length of shortest path to j on 
a given problem, and the random length of the shortest path to j under our model, we will use 
V(j) to denote the random variable. Likewise, we will use C(i, j) to describe the random weight on 
extant edge (i,j). 

We first show that if the edge weight random variables increase stochastically in their length, 
then V(j) increases stochastically in j . We then show that if the edge weights are Li.d or if the 
edge weights are identically distributed and all edge weights into a node are identical, and if w is 
0(P 3 ), then the probability of serialization is 0(1/ P). Surprisingly, the data in figure 3 show that 
this is a secondary concern, at least when D is not small. Nevertheless, it is an important question 
when D is small and the time spent in the second phase contributes significantly to the overall 
execution time. 

Our understanding of V(i) as the length of the optimal path from node 1 to i suggests that 
V(z) should tend to increase in i, at least if edge weights tend to increase in their length. Our first 
result affirms this intuition by using order relations between random variables. Principally, random 
variable X is said to be stochastically larger than random variable Y if for all a, Pr{X > a} > 
Pr{Y > a}. This is denoted X > st F, or Y < s t X . An equivalent definition says that Y < st X if 
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there exists a random variable Z distributed identically as P, with the property that Z < X. A 
common way to show that Y < st X is to construct a variable Z “based on” X, show that Y and 
Z are identically distributed, and show that Z < X. Called coupling , this will be our approach to 
showing that when V(i— 1) < s t V(t) when i > w and edge weights increase stochastically in length. 

Proposition 2 Suppose that C(i,j — 1) < s t C(i,j) whenever i < j — 1, and that for all j, C(i,j) is 
independent of V(i). Then V{i -*1) < s t V(i) for i = n/-f 

Proof: We use a coupling argument as described above. Define N(j — 1), given iV(j), as follows. 
Whenever (i,j) E N(j) and i ^ j- 1, then place (i,j — 1) E 7V( j— 1). If (j— 1, j) E then define 

(j — l — w^j — 1) £ N(j — 1). Because the edge weights increase stochastically in length, whenever 
i < j — 1 and (i, j) E iV(j) has a weight c, we may chose a weight c < c for s corresponding 
edge in iV(j - 1). If (j - 1 - w,j — 1) E N(j - 1) we may weight it with an arbitrary sample from 
C(j — 1 — w,j — l)’s distribution. Let Z be the length of the shortest path from node 1 to node 
j — 1. Z has V(j — l)’s distribution, because the choice of arcs into j — 1 is driven by the appropriate 
distributions, and the choice of edge weights are also from the appropriate distributions. We must 
now show that Z < V(j). 

Let H be the sum of all edge weights in the graph. For every i and j with i < j define 

B i _ { C(i, j) + V(») if (ij) € N(j) ' 

* \ H otherwise 

By construction Bj > Bj -1 for all i G [j — w,j — 2]. Now define 

M = min {B{}, and M = min {2?/ -1 } 

j — w < i < j — 2 j — w < i < j — 2 

It follows that M > M. Now 

V(j) = min{M,B 3 i _ l } > 


Consequently, V(j - 1) < s t V(j). 

□ 


min {M, minl^jlj.^,, 
min{5jlJ_ w ,M} 

Z. 
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The proposition above is also true in the case of a maximization problem; the proof above works 
by replacing all min’s with max’s. 

Two special cases are of particular interest. One occurs when the edge weights are independent 
and identically distributed, a degenerate case of C(i,j - 1) < st A second case occurs 

when edges are identically distributed, and all weights on edges entering a node are identical. 
Both cases are consistent with the hypothesis of proposition 1, and lead to the situation where 
we can stochastically order the edges entering a node. That is, if N(j) is ordered increasing as 
N(j) = h}, then for h = 2, . . . , k, C(i h - U j) + V{i h - 1 ) < s t C(i h ,j) + V(i/ v ). To see this, 

observe that V(i/ l _i) < s t V(ih ) implies the existence of a random variable Z having the distribution 
of and Z < V(ih)- Under the identified conditions we can choose a weight c having the 

distribution of C{ih-u j), where c < Then c + Z has the distribution of C(ik-i, j) + V(ih-\) 

and c + Z < C(ih,j) + V(i^), establishing proposition 3. 

Proposition 3 Suppose that all edge weights are i.i.d or that for each fixed j, is in- 

dependent of V(i) and C(i,j ) = C(k,j) for all i^k £ N(j)* V N(j) is ordered increasing as 
N(j ) = {e’i, . . . ,u}, then for h = 2 , . . . , k f we have 

C(ih-uj) + V(i fc _i) < s t + ^(4)* 


□ 

Let 


<lh 


= pr {. 


min {C(i,j) + V(i)} = C(ih,j ) + V(ifc) 
e N(j) 


to) j 


In the case of a tie between two or more edges, we define the edge from the least indexed nodes 
as the minimum. If the conditions of proposition 3 are satisfied, an immediate consequence of its 
conclusion is that ?i > <72 > • • • > Qk- Now for any j such that 1 < j < k, 

k 


1 _ i. _ 1 _ h _ 

7 1b ^ ^h- 

J h=l K 


/ 1=1 


Since J2h=i Qh = 1> we have Y^h=i 9* - If follows that if ej is the edge defining V(wj), then 


Pr{ej e No(wj)} > E 

> 


Oi 


Oj + Ij ^ 

E[Oj] E\Oj\ 

E[Oj + /,] D ' 
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The second step follows from Jensen’s inequality^. When w is large relative to P, then E[Oj ] is 
close to D . Such a situation supports our optimistic hope that serialization not be required. 

To avoid serialization we must have ej G No('Wj) for all j . The events that ej G No(u)j) 
and e{ G No(w{) are not independent. Intuitively though, we expect these events to be positively 
correlated — one way for e,- G No(w{) is to have it rooted in a node with an unusually low V value. 

If that node has an edge to wj as well, the probability that is also rooted in that node increases. 

A formal proof of positive correlation appears to be formidable. Assuming positive correlation, the 
probability of serializing the second phase is bounded from above by the probability of serializing 
if the events were independent. This gives 

(5) Pr{ej e N 0 (wj ) for j = 1, . . . , P} > £[ QQl. 

j = i 

We can compute E[Oj ] from (2). By subtracting the product from 1 we obtain an upper bound on 
the probability of serialization. Figure 4 plots this bound for D = 4 when P = 4,8 and 16, as a 
function of w (the same problem set represented in figure (3). Plots for D — 16 and D — 256 are 
indistinguishable from that of D = 4. We see that low serialization probabilities are achieved when 
w is large compared with P. In fact, we next demonstrate that the probability of serialization is 
0(1/P) when w is 0(P 3 ). This is significant, because then the expected complexity of the second 
phase is the same as the complexity if no serialization occurs. 

Proposition 4 Let the conditions of proposition 3 he met , and suppose that w = 0(P 3 ). Then the 
probability of serialization is 0(1/ P). 

Proof; Beginning with (5) we have 

Pr{ej G N 0 (wj) for j = 1, . . . , P} > 


( 6 ) > 

The last step follows from the inequality^: n£=i(l ~ a j) > 1 — a j whenever 0 < dj < 1 for 


•pr E\Oj\ 

1J i D 

j=i 



j = i 
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all j. Note that E[Ij] < E[Ip\ for all j; consulting (3) we see that 

eVp] = + — )p'(r-D 

= 0(1/P 2 ) + 0(D/P 2 ). 

The last step depends on the fact that p e = (D — 1 )/(w — 1), and that w = 0(n 3 ). Picking up 
from (6) we have 

1 p 

Pr{ej G N 0 (wj) for j = 1,...,P} > 1 ~ 

j = 1 

= 1 — 0(1/P) because E[Ip] = 0(D/P 2 ). 

It follows that the probability of serialization is 0(l/P). 

□ 

It is interesting to note that we place no conditions on D in order to achieve a 0(1/ P) probability 
of serialization. 

We strongly suspect that the probability of serialization is low more generally than just under 
the conditions of proposition 3. For example, one intuitively suspects that if edge weights are 
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concave in the edge length, then the edge defining a given V(j ) value will tend to be long, and 
hence rooted outside of V(j) J s block. An algorithmically uninteresting, but extreme case of this 
occurs if edge weights are exactly proportional to edge distance. In this case V( j) = rj for all j, 
where r is the constant of proportionality, and C(i,j ) + V(i) = rj for all i E N(j). Assuming that 
ties are broken by choosing the longest edge, V(j ) is always defined by the edge rooted furthest 
from j. 

4 Computational Experience 

We tested the block window algorithm against two competitors, an iterative, asynchronous “con- 
traction” algorithm, and the straightforward vertical method. We programmed all of these methods 
on the Flex/32^, a shared memory multiprocessor. We find that under our implementations on 
this architecture, the block window algorithm substantially outperforms both of the other methods. 
Furthermore, efficiencies in excess of 50% are achieved when D is sufficiently large relative to the 
number of processors. 

We have already mentioned the vertical method: the edges into each node are partitioned among 
processors, who then cooperatively compute the minimum edge sum. The pseudo-code in figure 5 
describes this algorithm. For each j we denote N(j) = {ji , . . . , jAr}- The synchronization routine 
ComputeMinimum determines the minimum value passed to it, and writes that into V(j). 

Load balancing under the vertical method is good when D is large — at every point no processor 
computes more than one more edge sum than any other. However, the method suffers when D is 
small, and it will always suffer a synchronization cost at each V point. 

An alternate method is based on iterative methods^. We annotate each V value with a super- 
script describing an “iteration number”. We initialize by setting V°(l) = 0, and V°(i) = H for all 
i > 1, where H is the sum of all edge weights, and consequently bounds the true value of V(i) for 
each i. The iterative computation described by 

(7) V k (j) = min {C(i,j) + V k ~ l (i)} for k > 0 

*' € N(j ) 

will converge to the correct solution. This computation uses two arrays for V values, one for 
the “old” (iteration k — 1) values, and one for the new (iteration k). Processor j is responsible 
for computing values V k (j + iP ) for i = 0 ,..., does so in increasing sequence, and synchronizes 
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VerticalMethod(pid) 

/* pid is processor id •*/ 

int pid; 

{ int j, MinEdgeCost ; 

for j = 2 to n 

{ 

MinEdgeCost = minimum value of j) + over all E iV(j), 

/i = pid -f tP for some t\ 

ComputeMinimum(pid, j , MinEdgeCost); 

} 


Figure 5: Pseudo-code for vertical method 


globally after such a pass. This algorithm has the disadvantage of needing multiple passes over the 
V array; it has the advantage of synchronizing only between passes. If L is the minimal number of 
edges in any shortest path from node 1 to node P , then convergence is detected in L + 1 iterations. 
We can improve upon this substantially by allowing asynchrony. We use only one V array, so that 
when computing V(j ), the values V(i) y i E N(j) may actually be a new”. For example, if we use 
only one processor, the true V values are correct after the first pass, and convergence is detected 
on the second. One processor processing of (7) will require L + 1 passes to detect convergence. 

"The pseudo-code in figure 6 describes an asynchronous iterative algorithm. The global synchro- 
nization routine CheckConvergence returns a zero if any processor passed a “Change” value of one 
to it; it otherwise returns a one. 

All of the results we report use n = 1024. We did test larger problems, but found that the 
performance figures were largely unafFected. Four different methods of creating edge weights were 
used. Method 1 chooses each node weight independently, and uniformly from [0,n]. Method 2 
makes the weight sensitive to the edge length, adding j — i to a uniform [0,2 n] random variable. 
Methods 3 and 4 both use uniform [0,n] weights; Method 3 ensures that edge weights rooted in a 
node increase as a function of their destination. Method 4 forces all weights on a node’s incoming 
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IterativeMethod(pid) 

/* pid is processor id */ 

int pid; 

{ int j, MinEdgeCost, Converged, Change; 

Converged = 0; 

/* loop through another pass */ 

while( Converged = 0) 

{ /* pid does every Pth point */ 

j = pid; 

Change = 0; 
while (j < n ) 

{ MinEdgeCost = minimum value of C(i, j) + V'(i) over all i £ N(j)\ 

/* look for improvement in solution -*/ 

If (MinEdgeCost < V(j)) 

{ V(j) — MinEdgeCost; 

Change — 1; 

} 

3= j + p \ 

} 

/* synchronize with others, check convergence */ 

Converged = CheckConvergence(pid, Change); 

} 

} 


Figure 6: Pseudocode for iterative algorithm 

edges to be identical. All of these methods satisfy the hypothesis of proposition 2; only Methods 
1 and 4 satisfy the hypothesis of proposition 3. We take w = 64 and w = 256; we also vary the 
average indegree, taking D = 8,16,32,48,64,96,128, subject to D < w. 

Early evaluation of the algorithms revealed that the execution times of both the block window 
and vertical methods on problems this large are highly insensitive to the particular edges and edge 
weights, and are insensitive to the different methods we used of generating edge weights. Some law 
of large numbers seems to keep the measured coefficient of variation (standard deviation divided 
by mean) well under 0.01 on ten replications. By comparison, execution time variation for the 
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iterative method was large. The asynchrony allows different runs (even on the same problem) to 
have different convergence rates. 

For every set of graph characteristics (D,w, weight generation method) we generate four inde- 
pendent graphs, and solve the shortest path problem with each algorithm, using 2,4,8, and 16 
processors. Each of our performance graphs plots the average efficiency as a function of D and the 
algorithm used — measurements from different problem types but with common and algorithm 
characteristics are averaged together. Figure 7 shows four plots of measured mean efficiencies, when 
w — 256. Each graph reports measurements from a fixed set of processors. Similar tests were run 
using w = 64, but surprisingly very little difference was observed, at least for the vertical and 
window block methods. Measurements from w = 256 tend to be better, but only marginally so. 
The iterative method does somewhat better on w — 256, because the number of edges in a shortest 
path tends to be smaller. 

Three trends in this data are interesting. First and foremost, performance of the block window 
method increases as the graph density ( D ) increases, and is often quite respectable. On the given 
implementations, architecture, and problem set, the other algorithms are clearly inferior. Second, 
better efficiencies are achieved using a smaller number of processors. For a fixed problem size 
it is almost always the case that increasing the number of processors decreases the efficiency, 
because more processors usually imply more overhead. The third trend is that the vertical method’s 
effectiveness decreases as the number of processors increases. This is due to the fact that the cost 
of synchronization increases as the number of processors increases, as does relative load imbalance. 

5 Summary 

This paper proposes the block window algorithm for solving sparse dynamic programming prob- 
lems on a parallel computer. Dynamic programming problems are characterized by their use of 
subproblem solutions to construct “larger” problem solutions. Sparse problems tend to force a 
completely serial execution if, when constructing a larger problem solution, we always wait for the 
solutions to all possible subproblems that might be needed. The key idea behind our algorithm 
is to optimistically assume in a first phase that the particular subproblem solution that will be 
needed is one that has already been computed. A second phase checks this assumption’s veracity, 
and corrects any erroneous calculations. 
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Figure 7: Measured efficiencies for different algorithms 
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The bulk of this paper analyzes the method’s performance quantitatively and qualitatively. The 
analysis gives us insight into the type of performance we can expect, depending on problem and 
architectural characteristics. We then compare its performance against two other algorithms on a 
shared-memory multiprocessor, observe that on the given problem set it performs markedly better 
than the others, and note that efficiencies in excess of 50% can be achieved using sixteen processors. 
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